# Define the graph theme custom function in order to standardize the graph analysis
theme_wealth <- function(){
theme_gray() +
theme(plot.caption = element_text(hjust = 0, face = "italic"),
plot.title = element_text(hjust = 0.5, color = "blue3", size = 16, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, size = 12),
panel.grid.minor = element_blank())
}The Dataset Description
The dataset comes from the US Government Bureau Census and highlights some wealth summary statistics by gender and social ethnic group over the past 50 years. The data is collected annually through a national survey that includes a probability sample from all the non-institutional American civilians and a few members of the US Armed Forces. It’s important to mention that wealth is a pretty generic concept and the original raw data is compounded by a different range of separated data tables with divergent specifications like Student Debts, retirement savings, incomes, etc. With that in hands, the idea of this following experiment is to examine those specifications and provide a clear perception of the wealth equality or inequality based on gender or social ethnic group.
As mentioned above, the dataset is made up by a different set of data tables with a vary range of wealth possible indicators that include the income distribution, retirement savings, student debts and house ownership by ethnic groups and year. Each data table deserve a special treatment their respective analysis and this experiment will evaluate one-by-one.
Income Distribution
In order to reach a good overview regarding the wealth structure within a population, lots of factors should be considered. However, undoubtedly, the average income per household is a very important or “the most” important aspect to consider and it’s with this factor that the wealth analysis is going to start. The income distribution data table contains the mean and median values from 1967 to 2019 of the following ethnic groups: Asian, Hispanic, Black and White; still those groups are not well-clearly divided. Due to this reason, some cleaning up stages are necessary to end up with the ones used in this experiment: Hispanic, White and Black. Obviously the Asian group could also be considered but it’s not present in the other data tables. Also, in the cleaning, the years considered were higher than 1971 due to the lack of Hispanic data for the previous years (1967-1971).
# Cleaning
income_dist <- data.table(tuesdata$income_distribution)
income_dist <- income_dist[race %in% c('White Alone' , 'Black Alone','Hispanic (Any Race)')][year > 1971]
# Order income_bracket column
income_dist$income_bracket <- factor(income_dist$income_bracket , levels = c('Under $15,000',
'$15,000 to $24,999',
'$25,000 to $34,999',
'$35,000 to $49,999',
'$50,000 to $74,999',
'$75,000 to $99,999',
'$100,000 to $149,999',
'$150,000 to $199,999',
'$200,000 and over'))
# Rename ethnic group names
income_dist$race <- gsub('(Any Race)', '', income_dist$race)
income_dist$race <- gsub('[[:punct:]]', '', income_dist$race)
income_dist$race <- gsub('Alone', '', income_dist$race)
income_dist_agg <- income_dist[ , list(income_mean = mean(income_mean),
income_median = mean(income_median),
income_med_moe = mean(income_med_moe),
income_mean_moe = mean(income_mean_moe),
number = mean(number)), by = list(race, year)]Income mean along the years by group
IC_GH1 <- ggplot(income_dist_agg, aes(x = year , y = income_mean , color = race)) +
geom_line() + geom_point() +
transition_reveal(year) +
labs(title = 'Income Distribuition',
subtitle = 'Total money income of householders separated by year and income groups',
caption = 'Data source: US Census Bureau',
y = 'Average Income by Househould (USD)',
x = 'Year') +
scale_x_continuous(limits = c(1972,2019), breaks = seq(1972,2019, by = 5)) +
scale_y_continuous(breaks = seq(0,105000, by = 15000) , labels = dollar) +
theme_wealth() +
theme(legend.position = "top" , legend.title = element_blank())
animate(IC_GH1, end_pause = 10, fps = 5)In this first graph, the income distribution by ethnic group and year, a few disparities can be found between the racial groups with a high advantage to the White population. Although this initial gap exists, it’s really interesting to highlight that the data follows the same trend during the years for all the ethnic groups without any specific growth pattern between them.
Income distribution by group
IC_GH2 <- ggplot(income_dist, aes(x = income_bracket, y = income_distribution , fill = race)) +
geom_bar(stat = 'identity' ) +
transition_states(year) +
labs(title = 'Income Distribuition separated by category and ethnic groups',
subtitle = 'Year: {closest_state}',
caption = 'Data source: US Census Bureau',
y = 'Frequency (%)') +
theme_wealth() +
theme(legend.position = "none" , axis.text.x = element_text(angle = 90, hjust = 1)) +
facet_wrap(~race)
animate(IC_GH2, end_pause = 10, fps = 5)On this second graph, Hispanic and White groups are turning their income distribution into a more “normal” shape. where it´s possible to check a movement from Hispanics at the “poor” class to the “middle” class and the White population migrating from the middle to the upper one. Considering now the Black population, the inequalities perceived between its own group doesn’t get much better with time like in the other ethnics. The pattern observed is still a highly skewed right-tailed distribution with really few people in the upper class.
Home ownership
The second dataset deals about another important wealth indicator, the home ownership. The following contains the information regarding the percent of house owned by ethnic group and year from 1976 until 2016.
house_ownership <- data.table(tuesdata$home_owner)
house_ownership[, home_owner_pct:= round(home_owner_pct * 100,2)]
HO_GH1 <- ggplot(house_ownership, aes(x = year, y = home_owner_pct, color = race)) +
geom_point() +
geom_smooth(method = 'lm', se = TRUE) +
theme_wealth() +
transition_states(race) +
labs(title = 'Home Ownership',
subtitle = 'Ethnic Group: {closest_state}',
y = 'Percent of Home Ownership by Families (%)') +
theme(legend.position = "top" , legend.title = element_blank())
animate(HO_GH1, end_pause = 10, fps = 5)On this third graph similar head start disparities can be checked between the groups wih the Whites being the ones in the lead, like it was observed in the income distribution. In addition, along the years Whites and Hispanic People, demonstrated a positive growth trend in terms of percentage of the families with their private houses,however same aspect can’t be seen into the Black community. There is no clear growth trend for this last ethnic group.
Student Debt
Moving along with the social wealth indicators, it’s time to check the average student loan debts. This indicator measures the average family student loan debt for aged 25-55 by race and year.
student_debt <- tuesdata$student_debt
# Average Student Loan Debt per family
SD_GH1 <- ggplot(student_debt, aes(x = year , y = loan_debt , color = race)) +
geom_line() + geom_point() +
transition_reveal(year) +
labs(title = 'Student Loan Debt',
subtitle = 'Average family student loan debt by ethnic group and year',
caption = 'Data source: US Census Bureau',
y = 'Average Student Loan Debt per Family (USD)',
x = 'Year') +
scale_x_continuous(limits = c(1989,2016), breaks = seq(1989,2016, by = 3)) +
scale_y_continuous(limits = c(0,15000), breaks = seq(0,15000, by = 5000) , labels = dollar) +
theme_wealth() +
theme(legend.position = "top" , legend.title = element_blank())
animate(SD_GH1, end_pause = 10, fps = 5)Following the same patterns observed in the past graphs, discrepancies are observed between the groups. But differently for the previous cases, Hispanics have an advantage on this aspect because their debt growth rate is indeed really lower than the ones observed in White and Black populations. Until 2000, the student debt rates were similar between those groups and after this period the Black and White rates sky-rocketed.
Retirement Savings
The final social indicator is the total retirement savings done by each one of those groups distributed by year. The numbers are considering the year average family liquid retirement savings in US dollars.
retirement <- tuesdata$retirement
RE_GH1 <- ggplot(retirement, aes(x = race, y = retirement, fill = race)) +
geom_bar(stat = 'identity') +
theme_wealth() +
transition_states(year) +
labs(title = 'Average family liquid retirement savings',
subtitle = 'Year: {closest_state}',
y = 'Average Retirment Savings by Families (USD)') +
theme(legend.position = "top" , legend.title = element_blank())
animate(RE_GH1, end_pause = 10, fps = 5)Under this retirement savings graph, as logically expected, not only discrepancies exist but also Whites assume the lead in terms of the amount of savings. However, different from what was observed in the other past analysis, that was the first time that it could be observed a greater rate increase into the white group compared to the other ones. This probably could be explained as the reflect of the better results obtained into the other social indicators.
Final Analysis
Using a method called Multidimensional Scaling, all of those social indicators were brought together. And according to their specific social indicator results, the ethnic groups could be spatially distributed based on similar characteristics, having the ones with closer social aspects, also physically closer in terms of coordinates.
joined_dataset <- data.table(income_dist_agg %>%
inner_join(house_ownership, by = c('race','year')) %>%
inner_join(retirement, by = c('race','year')) %>%
inner_join(student_debt, by = c('race', 'year')))
#Remove columns that might be bringing some bias
joined_dataset[, c("income_med_moe", "income_mean_moe" , "number"):= NULL]saveGIF(
for (i in sort(unique(joined_dataset$year))){
data_subset <- joined_dataset[year == i]
rownames(data_subset) <- data_subset$race
data_subset$race <- NULL
mds <-as.data.frame(cmdscale(dist(scale(data_subset))))
mds$race <- rownames(data_subset)
p <- ggplot(mds, aes(V1, V2, label = race)) +
labs(title = 'CMD Scaling by Ethnic Group',
subtitle = paste0('Year: ', i),
caption = 'Data source: US Census Bureau') +
xlim(-4,4) + ylim(-4,4) +
geom_text_repel() + theme_wealth()
plot(p)
}
)## Output at: animation.gif
Conclusion
Having a quick look at the graphs, it’s possible to observe that although the White people have a considerable better initial stage in all categories (income, retirement savings, etc.) there is no clear conclusion about any sort of discrimination regarding the other ethnic groups. This could be explained due to the similar growth rates observed and social indicators seen into the Hispanic group. Although the difference in absolute values can be significant, there was a really closer improvement observed into the social indicators between White and Hispanics.